Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encapsulate View logic for GenericByteViewArray #5619

Closed
wants to merge 2 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Apr 9, 2024

Draft PR as I would love to know if anyone has any comments about this idea / approach

Note this looks like a large PR but it is mostly comments and tests

Which issue does this PR close?

Part of #5374

Rationale for this change

While reviewing #5557 from @ariesdevil -- it seemed like the 12 byte view length was getting hard coded many places and the same code is repeated over and over and makes the code harder to work with.

While encoding the number 12 isn't a huge deal, what is a larger deal is the cognative overhead of working with ByteViewArrays while having to remember in your head the layouts of what the u128 represents. Adding an abstraction makes this easier in my opinion

What changes are included in this PR?

Encapsulate view manipulation in typesafe wrappers over u128

  1. Add a View enum that captures which view is used
  2. Add an OwnedView and OwnedViewBuilder for creating Views
  3. Add InlineView and OffsetView
  4. Rewrite some of the code to use the new View

Are there any user-facing changes?

Several new types, but all existing types still pass

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 9, 2024
arrow-data/src/transform/mod.rs Outdated Show resolved Hide resolved
@alamb alamb changed the title WIP: Encapsulate View logic more WIP: Encapsulate Short/Long View logic in StringViewArray Apr 9, 2024
@alamb alamb changed the title WIP: Encapsulate Short/Long View logic in StringViewArray Encapsulate View logic for GenericByteViewArray May 6, 2024
///
/// ```text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to View

if !flushed.is_empty() {
assert!(self.completed.len() < u32::MAX as usize);
self.completed.push(flushed.into());
let view: u128 = match OwnedView::from(v) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the construction is now much clearer and encapsulated -- the bit manipulation is handled by OwnedView and OffsetViewBuilder

}
impl<'a> From<&'a u128> for View<'a> {
#[inline(always)]
fn from(v: &'a u128) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the key API for a View -- dispatching to the correct variant

}
}

/// A view for data where the variable length data has 12 or fewer bytes. See
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR is merged, I will make a follow on PR to remove the remaining uses of ByteView and deprecate this struct

@@ -178,13 +178,17 @@ fn build_extend_view(array: &ArrayData, buffer_offset: u32) -> Extend {
mutable
.buffer1
.extend(views[start..start + len].iter().map(|v| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another example of how the magic 12 number gets extracted out

/// # Notes
/// Equality is based on the bitwise value of the view, not the data it logically points to
#[derive(Debug, Copy, Clone, PartialEq)]
pub enum View<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need into_owned that return OwnedView?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will add it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fe510ef

@alamb
Copy link
Contributor Author

alamb commented May 6, 2024

CI is failing due to #5725 and #5719

But otherwise I think this PR is ready for review

@alamb alamb marked this pull request as ready for review May 6, 2024 13:59
Copy link
Contributor

@ariesdevil ariesdevil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear and neat

@alamb
Copy link
Contributor Author

alamb commented May 6, 2024

Thanks for the review @ariesdevil

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a brief look at this and I think this could be made a fair bit simpler, in particular I don't really understand the need for separate borrowed/owning variants when u128 is copy.

However, that does lead to my biggest concern, that the formulation using an enumeration forces a branch where it isn't strictly necessary. For example, prefix comparison can be done without needing to check the length at all. This makes me wonder if just a simple ByteView(u128) would suffice to encapsulate the short-string logic, or something...

I personally would feel more comfortable introducing such abstractions after we have a decent set of kernels with accompanying benchmarks, so we an empirically reason about the performance implications of such a change, along with having examples to inform the API design. As it stands this feels like a touch premature, given it really just avoids encoding the number 12 in various places 😅

Edit: I created #5735 which might provide a simpler way to achieve the same end

/// # Notes
/// Equality is based on the bitwise value of the view, not the data it logically points to
#[derive(PartialEq)]
pub enum OwnedView {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this type adding, it feels like it isn't entirely necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its usecase is to create a view from &str / &[u8] and copy the relevant prefix bytes and remember which variant (inline or offset) the new view was during creation

/// # Notes
/// Equality is based on the bitwise value of the view, not the data it logically points to
#[derive(Debug, Copy, Clone, PartialEq)]
pub enum View<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be ByteView to avoid confusion with the list view types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already another struct named ByteView in this file, so in order to avoid an API change I didn't reuse the name. If we don't care about API changes we could remove the existing ByteView

Comment on lines +245 to +246
pub fn as_u128(self) -> &'a u128 {
self.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn as_u128(self) -> &'a u128 {
self.0
pub fn as_u128(self) -> u128 {
*self.0

/// Equality is based on the bitwise value of the view, not the data it
/// logically points to
#[derive(Copy, Clone, PartialEq)]
pub struct InlineView<'a>(&'a u128);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we even need this borrow, u128 is copy and removing the indirection might help LLVM not be stupid

@tustvold tustvold mentioned this pull request May 8, 2024
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a brief look at this and I think this could be made a fair bit simpler, in particular I don't really understand the need for separate borrowed/owning variants when u128 is copy.

That was to encapsulate copying the inlined bytes and retain the information about which type it was (without having to check the length again)

However, that does lead to my biggest concern, that the formulation using an enumeration forces a branch where it isn't strictly necessary. For example, prefix comparison can be done without needing to check the length at all.

I don't understand this assertion. In my mind the whole point of the View abstraction is to encapsulate the check for length as the two different variants need different handling. For special cases like prefix comparsion we could certainly add specialized functions like View::compare_prefix

I personally would feel more comfortable introducing such abstractions after we have a decent set of kernels with accompanying benchmarks, so we an empirically reason about the performance implications of such a change, along with having examples to inform the API design. As it stands this feels like a touch premature, given it really just avoids encoding the number 12 in various places 😅

I think the kernels and usage patterns we have are sufficient. There are several other uses of ByteView that I didn't port over in this PR to keep it smaller (such as

pub(super) fn byte_view_equal(
lhs: &ArrayData,
rhs: &ArrayData,
lhs_start: usize,
rhs_start: usize,
len: usize,
) -> bool {
let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
let lhs_buffers = &lhs.buffers()[1..];
let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
let rhs_buffers = &rhs.buffers()[1..];
for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
// Only checking one null mask here because by the time the control flow reaches
// this point, the equality of the two masks would have already been verified.
if lhs.is_null(idx) {
continue;
}
let l_len_prefix = *l as u64;
let r_len_prefix = *r as u64;
// short-circuit, check length and prefix
if l_len_prefix != r_len_prefix {
return false;
}
let len = l_len_prefix as u32;
// for inline storage, only need check view
if len <= 12 {
if l != r {
return false;
}
continue;
}
// check buffers
let l_view = ByteView::from(*l);
let r_view = ByteView::from(*r);
let l_buffer = &lhs_buffers[l_view.buffer_index as usize];
let r_buffer = &rhs_buffers[r_view.buffer_index as usize];
// prefixes are already known to be equal; skip checking them
let len = len as usize - 4;
let l_offset = l_view.offset as usize + 4;
let r_offset = r_view.offset as usize + 4;
if l_buffer[l_offset..l_offset + len] != r_buffer[r_offset..r_offset + len] {
return false;
}
}
true
) but I can do so to show how it would work

Edit: I created #5735 which might provide a simpler way to achieve the same end

I don't understand how this solves the same problem but I probably don't fully understand it

/// # Notes
/// Equality is based on the bitwise value of the view, not the data it logically points to
#[derive(Debug, Copy, Clone, PartialEq)]
pub enum View<'a> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already another struct named ByteView in this file, so in order to avoid an API change I didn't reuse the name. If we don't care about API changes we could remove the existing ByteView

/// # Notes
/// Equality is based on the bitwise value of the view, not the data it logically points to
#[derive(PartialEq)]
pub enum OwnedView {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its usecase is to create a view from &str / &[u8] and copy the relevant prefix bytes and remember which variant (inline or offset) the new view was during creation

@alamb
Copy link
Contributor Author

alamb commented May 8, 2024

In order to move this PR forward, I plan to do the following:

  1. Run (and create if necessary) benchmarks for operations
  2. Remove the remaining uses of ByteView in favor of this new View abstraction to see what it looks like

@alamb
Copy link
Contributor Author

alamb commented May 8, 2024

As it stands this feels like a touch premature, given it really just avoids encoding the number 12 in various places 😅

I updated the description of this PR to make the rationale clearer.

While encoding the number 12 isn't a huge deal in my mind, the core rationale for this PR does is to reduce the cognative overhead of working with ByteViewArrays. Without an abstraction such as View, you have to remember what the u128 represents and both its variants.

While some people have the necessary time to invest understanding the lowest level representation, I think it is a barrier to contribution (as well as using the library). I have seen evidence of this challenge in a few examples such as #5557 to build up the u128 from the parquet data pages as well as #5707 (comment) in #5707.

While there might be other explanations, I think an abstraction like this will make ByteViewArrays much easier to work with

@tustvold
Copy link
Contributor

Marking as a draft pending #5736

@tustvold tustvold marked this pull request as draft May 16, 2024 10:28
@alamb
Copy link
Contributor Author

alamb commented May 29, 2024

I believe we are taking a different approach -- see #5736 and #5796

@alamb alamb closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants